Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Support batch export models as views #23052

Merged
merged 49 commits into from
Jun 24, 2024

Conversation

tomasfarias
Copy link
Contributor

@tomasfarias tomasfarias commented Jun 18, 2024

Problem

This PR sets the ground work to support person batch exports by introducing a set of parametrized views that will serve as the basis of batch exports moving forward. This new set of views includes a view for the persons model. Using parametrized views over the old approach has the benefit of being easier to parse the query logic when it's laid out in plain SQL (ideally this should be SQL files, but our migration system works with Python files...).

Changes

  • Introduce new parametrized views for person and events batch exports.

    • The person model is completely new, so no impact on existing functionality. This PR only introduces the view of the model, a follow-up PR will make use of it.
    • On the other hand, the event model is changing slightly with the new view:
      • We are now using GROUP BY instead of DISTINCT. This should perform better and use less memory (in particular with optimize_aggregation_in_order enabled).
      • We are now using a PREWHERE clause for filtering first on inserted_at. Once again, this should result in better performance.
      • Event model is divided into 3: "default", "unbounded", and "backfill", with one view for each. Determined by how they use timestamp and inserted_at filters. Should behave the same as before, but laid out in plain SQL instead of composed via Python formatting.
      • For legacy reasons, event model is different according to destination. Eventually, I'll create views for each destination to also lay them out in SQL
    • Using the views required refactoring large portions of code.
  • Introduce new methods to iterate over records asynchronously.

    • This requires implementing parsing of pyarrow.RecordBatch.
    • Will initially only use these to iterate over records of new models (i.e. persons), and switch to events on a later date.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

How did you test this code?

  • Updated a few tests to work with views.
  • Ran all unit tests.

@tomasfarias tomasfarias requested a review from fuziontech as a code owner June 18, 2024 15:02
Comment on lines 143 to 129
@pytest_asyncio.fixture(scope="module", autouse=True)
async def create_batch_export_views(clickhouse_client, django_db_setup):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently clickhouse migrations don't really run in tests.

@@ -1,6 +1,5 @@
import datetime as dt
import json
import uuid
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every change on this file is just cleaning up.

@@ -357,6 +359,23 @@ def stream_query_as_arrow(
with pa.ipc.open_stream(pa.PythonFile(response.raw)) as reader:
yield from reader

async def astream_query_as_arrow(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in use yet, will use for person model exports.

# without battle testing it first.
# There are already changes going out to the queries themselves that will impact events in a
# positive way. So, we can come back later and drop this block.
for record_batch in iter_records(client, team_id=team_id, is_backfill=is_backfill, **parameters):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See TODO above: For now, keeping old functionality intact for safety. Will release person model first, test, and then switch events.

@tomasfarias tomasfarias force-pushed the feat/add-schema-selector-to-batch-exports-ui branch from a2744da to 4f182c7 Compare June 18, 2024 15:22
@@ -77,78 +77,20 @@ def get_timestamp_field(is_backfill: bool) -> str:
return timestamp_field


async def get_rows_count(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't count rows anymore, so this function is unused. Cleaning up.

Base automatically changed from feat/add-schema-selector-to-batch-exports-ui to master June 18, 2024 18:04
@tomasfarias tomasfarias force-pushed the refactor/support-for-batch-export-model-views branch from ecb38cb to dc9535f Compare June 19, 2024 08:57
Comment on lines 51 to 53
PREWHERE
COALESCE(events.inserted_at, events._timestamp) >= {interval_start:DateTime64}
AND COALESCE(events.inserted_at, events._timestamp) < {interval_end:DateTime64}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😎

At this point, inserted_at should always be set for all batch
exports. Only historical exports require _timestamp, but backfills
have already been switched over to query based on timestamp, so they
also do not need to check for inserted_at/_timestamp.

Removing the colaesce and using only inserted_at reduces the size
of the data CH has to fetch by half.
@tomasfarias tomasfarias force-pushed the refactor/support-for-batch-export-model-views branch from 6769736 to 2eabe0c Compare June 21, 2024 15:27
Copy link
Member

@fuziontech fuziontech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I'm excited to use Views here...easy to iterate and low risk.

)
FORMAT ArrowStream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield record_batch
return

async for record_batch in client.astream_query_as_arrow(view, query_parameters=parameters):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice.

@tomasfarias tomasfarias merged commit ca0bf0b into master Jun 24, 2024
85 checks passed
@tomasfarias tomasfarias deleted the refactor/support-for-batch-export-model-views branch June 24, 2024 15:09
timgl pushed a commit that referenced this pull request Jun 27, 2024
* refactor: Update metrics to fetch counts at request time

* fix: Move import to method

* fix: Add function

* feat: Custom schemas for batch exports

* feat: Frontend support for model field

* fix: Clean-up

* fix: Add missing migration

* fix: Make new field nullable

* Update UI snapshots for `chromium` (1)

* Update UI snapshots for `chromium` (1)

* Update UI snapshots for `chromium` (1)

* Update UI snapshots for `chromium` (1)

* Update UI snapshots for `chromium` (1)

* Update UI snapshots for `chromium` (1)

* fix: Bump migration number

* fix: Bump migration number

* refactor: Update metrics to fetch counts at request time

* fix: Actually use include and exclude events

* refactor: Switch to counting runs

* refactor: Support batch export models as views

* fix: Merge conflict

* fix: Quality check fixes

* refactor: Update metrics to fetch counts at request time

* fix: Move import to method

* fix: Add function

* fix: Typing fixes

* feat: Custom schemas for batch exports

* feat: Frontend support for model field

* fix: Clean-up

* fix: Add missing migration

* fix: Make new field nullable

* Update UI snapshots for `chromium` (1)

* Update UI snapshots for `chromium` (1)

* Update UI snapshots for `chromium` (1)

* Update UI snapshots for `chromium` (1)

* Update UI snapshots for `chromium` (1)

* Update UI snapshots for `chromium` (1)

* fix: Bump migration number

* fix: Clean-up unused code

* chore: Clean-up unused function and tests

* fix: Clean-up unused function

* fix: HTTP Batch export default fields

* fix: Remove test case on new column not present in base table

* chore: Clean-up unused functions and queries

* fix: Only run extra clickhouse queries in batch exports tests

* refactor: Remove coalesce and use only inserted_at in queries

At this point, inserted_at should always be set for all batch
exports. Only historical exports require _timestamp, but backfills
have already been switched over to query based on timestamp, so they
also do not need to check for inserted_at/_timestamp.

Removing the colaesce and using only inserted_at reduces the size
of the data CH has to fetch by half.

* fix: Remove deprecated test

* fix: Add person_id to person model and enforce ordering

* refactor: Also add version column

---------

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants